A major public health concernDiabetes mellitus (DM) is associated with obesity, age, race, and gender and identifying associated risk factor is crucial for targeted intervention.Logistic Regression estimates the association between risk factors and binary outcomes (presence or absence of diabetes). Standard analytical approaches are insufficient in analyzing the complexity of healthcare data (DNA sequences, imaging, patient-reported outcomes, electronic health records (EHRs), longitudinal health measurements, diagnoses, and treatments. (Zeger et al., 2020). However, classical maximum likelihood estimation (MLE) yields unstable results in small samples with missing data, or quasi- and complete separation. The Bayesian hierarchical model with Markov Chain Monte Carlo (MCMC) implemented on multivariate longitudinal healthcare data by integrating prior knowledge predict health status (Zeger et al., 2020). Model with two levels of data structure: (1) repeated measures over time within individuals and (2) individuals nested within a population, with added exogenous covariates (e.g., age, clinical history), and endogenous covariates (e.g., current treatment), yield posterior distributions and marginal distributions from MCMC estimation of parameters provide risk prediction (pneumonia, prostate cancer, and mental disorders). The model’s limitation is its parametric nature.
Application of Bayesian InferenceChatzimichail and Hatjimihail (2023), comparing parametric (with a fixed set of parameters) and non-parametric distributions (which do not make a priori assumptions) on National Health and Nutrition Examination Survey data from two separate diagnostic tests on both diseased and non-diseased populations, and provides posterior probability classifying diseases. Clinical criteria and fixed numerical thresholds in conventional and the dichotomous method (overlap of probability distributions between the diseased and nondiseased groups) fails to capture the intricate relationship between diagnostic tests and the prevalence of the diseases, the complexity and heterogeneity across diverse populations and its applicability in skewed, bimodality, or multimodality data is critiqued. Bayesian nonparametric (vs parametric) is a flexible, adaptable, versatile, and robust approach, capturing complex data patterns, producing multimodal probability patterns vs the bimodal, double-sigmoidal curves in parametric models. Integrating priors, combined with multiple diagnostic tests, improves diagnostic accuracy and precision. Model applicability is limited by access to scholarly publications and over-dependence on priors. Combining with other statistical and computational techniques enhances diagnostic capabilities Chatzimichail and Hatjimihail (2023) to overcome systemic bias, unrepresentative, incomplete, and non-normal datasets.
Bayesian methodology by Schoot et al. (2021) emphasizes the importance of priors, data modeling, inferences, model checking, sampling from a posterior distribution, variational inferences, and variable selection for applicability across social sciences, ecology, genetics, and medicine. The variable selection is crucial as multicollinearity, insufficient sampling, and overfitting result in poor predictive performance and difficult interpretation. Informative, weakly informative, and diffuse prior incorporation depending on (un)certainty in (hyperparameters), where a larger variance representing greater uncertainty. Prior elicitation (experts, generic experts, data-based, and sample data using maximum likelihood or sample statistics), and prior sensitivity analysis of the likelihood assesses how the priors and the likelihood align. Prior provides data-informed shrinkage, regularization, or influence algorithms, providing a high-density region, improving estimation. Specifying prior information in small and less informative samples, strengthens the observed data with unknown parameters having varied values, observed data having fixed values, and the likelihood function generate a range of possible values and integrating the MCMC algorithm for sampled values from a given distribution through computer simulations provide empirical estimates of the posterior distribution (BRMS and Blavaan in R). The frequentist method does not consider the probability of the unknown parameters and considers them as fixed, while likelihood is based on the conditional probability distribution. Spatial and temporal Bayesian models has applicability in large-scale cancer genomic data, identifying novel molecular-level changes, interactions between mutated genes, capturing mutational signatures, allowing genomic-based patient stratification in clinical trials, and targeted treatments and in understanding cancer evolution. The Bayesian model is reproducible, but is limited by autocorrelation in the temporal model and by subjectivity in prior elicitation.
Prior elicitation, analytical posteriors, robustness checks in Bayesian Normal linear regression, and parametric (conjugate) model incorporating Normal–Inverse-Gamma prior have been demonstrated in metrology Klauenberg et al. (2015) to calibrate instruments. In Gaussian, errors are independent and identically distributed, the variance is unknown, the relationship between X and Y is statistical, with noise and model uncertainty, and the regression can not be treated as a measurement function. Likelihood, Bayesian, bootstrap, etc., account for uncertainty, prior information, and observables (data) and unobservables (parameters and auxiliary variables) are unknown and random, and the assigned probability distributions update prior knowledge about the unobservables, enhance interpretation and robustify analyses. The Normal-Inverse Gamma (NIG) distribution from the same family as the conjugate prior with unknown mean and variance can specify vague or non-informative priors. Hierarchical prior add an additional layer of distributions, accounting for uncertainty to be more flexibly modeled.
Bayesian Hierarchical / meta-analytic linear regression model (DeLeeuw2012?) augments data by incorporating both exchangeable and unexchangeable information on parameters addressing issues associated with multiple testing with low statistical power, and the issues of conducting separate significance tests across studies with different predictors, and the need for larger samples. Linear regression produce smaller, unreliable estimates vulnerable to sample variations. Priors from meta-analysis in Bayesian regression addresses the challenge of small sample size and unavailability of previous articles, resolving the limitations of univariate analyses, and the relationship issues among multiple regression parameters within a study. Priors based on previous data and current data are categorized (1) Exchangeable when the current data and previous studies share the same set of predictors, and (2) Unexchangeable when the predictors differ. The probability density function for the data (using the Gibbs sampler), and the likelihood function reflect prior assumptions about the model. The hierarchical unexchangeable model provide applicability is in studying differences in studies, enabling explicit testing of the exchangeability assumption. Application is limited due to the correlation between identical set of predictors. (DeLeeuw, 2012).
Bayesian logistic regression (Bayesian GLM)**- A sequential clinical reasoning model. Liu (2013) demonstrated its applicability in screening adults (20–79 years, Taiwan) addressing the limited availability of molecular information and as an alternative method leveraging routinely collected biological markers classifying diseases. Sequential adding of predictors in three models: (1) demographic features (basic model), (2) six metabolic components (metabolic score model), and (3) conventional risk factors (enhanced model), incorporating priors, and emulating a clinician’s evaluation process, the model assumes normally distributed regression coefficients, accounts for uncertainty in clinical weights, and averages credible intervals for predicted risk estimates. The posterior distributions produced in Enhanced model showed that patient background significantly contributed to baseline risk estimation by integrating individual characteristics capturing ecological heterogeneity. The model applicability is limited by potential interactions between predictors and external cross-validation.
Bayesian multiple imputation with logistic regression, (Austin?) 2021, addresses missing data in clinical research. Analyzing causes of missing values (i) patients refusing to answer specific questions, (ii) loss to follow-up, (iii) investigator or mechanical errors, or (iv) physicians choosing not to order certain investigations and understanding missingness: missing at random (MAR), missing not at random (MNAR), or missing completely at random (MCAR) is crucial. Multiple imputation (MI) using R, SAS, or Stata provide plausible values creating multiple completed datasets while simultaneously conducting identical statistical analyses across them, robustify estimates through pooled results in classifying patients health status and mortality rates.
Aims
The present study focuses on the application of Bayesian logistic regression to predict diabetes status based on body mass index (BMI), age, gender, and race as predictors using a retrospective dataset (2013–2014 NHANES survey). The dataset reveals challenges such as quasi-separation, missing values, and a relatively small effective sample size, and the traditional logistic regression has limitations in dealing with these anomalies. Initial data exploration yielded 9,813 observations across five selected variables. The results from complete case analysis, listwise deletion, substantially reduced the sample size to only 14 complete cases and presented quasi-separation with implausibly large coefficients and unstable estimates. The analytic limitations of traditional logistic regression motivate us to perform Multiple Imputation by Chained Equations (MICE) in conjunction with Bayesian logistic regression. The approach could provide a flexible framework for modeling uncertainty, incorporating prior knowledge, and addressing issues related to quasi-separation and limited sample size.
Method and Data Preparation
Statistical Tool R, R packages, and libraries are used to import, manage and analyze the data. Data is collected from NHANES 2-year cross-sectional data (2013-2014 year) using 3 datasets (demographics, exam, questionnaire) Center for Health Statistics (1999). Haven package coverted .XPT files in R to dataframe (df).
Subsets created from the original weighted 3 datasets (demographics, exam, questionnaire) were merged using ID into a single dataframe. The merged dataframe included selected variables of interest.
Response Variable (Binary, Diabetes) was defined as - “Is there one Dr you see for diabetes”
Predictor Variables (Body Mass Index, factor, 4 levels) The original data has BMDBMIC (measured BMI) as categorical and had no missing values. It (BMI) has the following 4 levels:
o Underweight (<5th percentile)
o Normal (5th–<85th)
o Overweight (85th–<95th) o Obese (≥95th percentile)
o Missing We kept it as it is as categorization provides clinically interpretable groups
Covariates:
Gender (factor, 2 levels): Male: Female
Ethnicity (factor, 5 levels): Mexican American, Non-Hispanic, White Non-Hispanic, Black Other Hispanic, Other Race - Including Multi-Racial
Age (num, continuous)
Code
library(gt)# formation of table with variable detailsvariables <-c("ID", "BMI" , "Age", "Gender" , "Race", "PSU", "Strata", "Weight" , "Diabetes")df <-data.frame(Variable = variables, Description = descriptions <-c("Respondent sequence number", "BMI calculated as weight in kilograms divided by height in meters squared, and then rounded to one decimal place.", "Age in years of the participant at the time of screening. Individuals 80 and over are topcoded at 80 years of Age.", "Gender", "Race/ethnicity Recode of reported race and Hispanic origin information", "Sample PSU", "Sample strata", "MEC exam weight", "Diabetes status Is there one doctor or other health professional {you usually see/SP usually sees} for {your/his/her} diabetes? Do not include specialists to whom {you have/SP has} been referred such as diabetes educators, dieticians or foot and eye doctors."))df %>% gt %>%tab_header(title ="Table Variable Description" ) %>%tab_footnote(footnote ="Each variable in the dataset, accompanied by a qualitative description from the study team." )
Table Variable Description
Variable
Description
ID
Respondent sequence number
BMI
BMI calculated as weight in kilograms divided by height in meters squared, and then rounded to one decimal place.
Age
Age in years of the participant at the time of screening. Individuals 80 and over are topcoded at 80 years of Age.
Gender
Gender
Race
Race/ethnicity Recode of reported race and Hispanic origin information
PSU
Sample PSU
Strata
Sample strata
Weight
MEC exam weight
Diabetes
Diabetes status Is there one doctor or other health professional {you usually see/SP usually sees} for {your/his/her} diabetes? Do not include specialists to whom {you have/SP has} been referred such as diabetes educators, dieticians or foot and eye doctors.
Each variable in the dataset, accompanied by a qualitative description from the study team.
The dataframe structure, missing values and a plot of the data and the breakdown of the missingness reveal only 14 complete cases with no NAs.
Code
# weighted means of each variable str(merged_data)
'data.frame': 9813 obs. of 9 variables:
$ ID : num 73557 73558 73559 73560 73561 ...
$ BMI : Factor w/ 4 levels "Underweight",..: NA NA NA 2 NA NA NA NA NA NA ...
$ Age : num 69 54 72 9 73 56 0 61 56 65 ...
$ Gender : Factor w/ 2 levels "Male","Female": 1 1 1 1 2 1 1 2 2 1 ...
$ Race : Factor w/ 5 levels "Mexican American",..: 4 3 3 3 3 1 3 3 3 3 ...
$ PSU : num 1 1 1 2 2 1 1 1 1 2 ...
$ Strata : num 112 108 109 109 116 111 105 114 112 112 ...
$ Weight : num 13481 24472 57193 55767 65542 ...
$ Diabetes: Factor w/ 2 levels "Yes","No": 1 1 2 NA NA NA NA NA NA NA ...
mean SE
RaceMexican American 0.110450 0.0199
RaceOther Hispanic 0.060637 0.0106
RaceNon-Hispanic White 0.622315 0.0365
RaceNon-Hispanic Black 0.120895 0.0173
RaceOther Race - Including Multi-Racial 0.085704 0.0076
Explain your data preprocessing and cleaning steps.
Using library(survey), the weighted means and sd of the variables extracted
Data varaibles categorized and recoded.
Special codes are not random and cannot be dropped, the informative missingness if ignored (MAR or MNAR) could introduce bias, we transformed special codes into NAs (based on the variable codebook).
All NAs were included in the analysis. NAs are automatically excluded (row-wise deletion) in linear regression lm ()
Yes No <NA>
Mexican American 0.8763885 0.4178131 15.8768980
Other Hispanic 0.4891470 0.1528585 8.8352186
Non-Hispanic White 2.0992561 0.5706716 33.3842862
Non-Hispanic Black 1.4878223 0.3668603 20.5441761
Other Race - Including Multi-Racial 0.6827678 0.2140018 14.0018343
Yes No <NA>
Male 2.6903088 0.8865790 45.6537247
Female 2.9450729 0.8356262 46.9886885
Code
## Bar plot of Age, gender, race, diabetes status, BMI ## plot_bar(merged_data, title ="Figure 3(Merged dataset). Frequency plots of categorical variables.")
Method
The study conducted frequentist methods: Multiple Logistic regression model, Baseline Regression model and Firth penalized regression on NHANES dataset to predict Diabetes as a function of BMI, age, race, and gender and then compared results from MLR and Bayesian model.
Bayesian Logistic Regression
Bayesian Logistic Regression statistical analysis is selected for our data as the study response variable is a binary outcome (Diabetes:yes/no)
Bayesian Logistic Regression is based on binomial probability Bayes rules
Bayes rule analyzes linear relation between predictor (Age, Race, BMI, Gender) and outcome response variable (Diabetes) where the predictors and response variables are independent.
Regression of discrete variable that can have two values, 0 or 1 is Bernoulli probability model to classify categorical response variables - predicting Diabetes.
Logit link provides probabilities for the response variable.
Data exploration of the raw dataset
Before running Bayesian regression, basic statistics, summary, anamolies and patterns reported Descriptive statistics of variable (counts, frequencies, proportions, mean and sd), and the proportions of variable/s calculated. Visualization using frequency plots for continuous and categorical variables
Multiple logistic regression on raw dataset
resulted in small sample size (n=14) - listwise deletion of NAs
quasi-separation Van Buuren and Van Buuren (2012).
Data revealed breakdown of missingness missing at random and missing not at random (MAR and MNAR) - A common reported issue of the healthcare and public health datasets. Presented here is the plot depicting breakdown of missingness
Baseline regression model (BMI only predictor)
Baseline model conducted to compare whether predictors significantly improve predictive power.
The regression resulted in small drop and that BMI category adds very little predictive value over just assuming the overall diabetes prevalence.
Null deviance = 16.75 (baseline fit).
Residual deviance = 15.11 (with BMI).
Firth (penalized) regression
Firth (penalized) regression (frquentist approach) was considered to handle quasi-separation, D’Angelo and Ran (2025) that use Jeffreys prior for bias correction. It does not provide posterior and no sampling using MCMC (vs) bayesian logisitic regression.
Unexpected reports, patterns or anomalies in the raw data
Issue of quasi-complete separation (9799 observations dropped)
Reduced sample size with reduced number of complete cases (n=14).
The model is overfitted to this subset and cannot be generalized.
Huge coefficients (e.g., 94, –50, 73) and the tiny residual deviance suggest perfect separation and sparse data in some categories with very few observations, resulted in imbalance in the outcome (very few cases of 0 or 1).
Logistic regression cannot estimate stable coefficients when predictors perfectly classify the outcome.
Firth regression dealt with quasi-separation with coefficients as finite, but the reduced sample size (n= 14) where estimates are highly uncertain, wide confidence intervals → cannot make strong claims about predictor effects.
multivariate missingness, non-monotone (arbitrary) missingness, a connected pattern was observed in some cases in all variables. The issue is crucial because in order to be able to estimate a correlation coefficient between two variables, they need to be connected, either directly by a set of cases that have scores on both, or indirectly through their relation with a third set of connected data.
Model Interpretations: - Only 14 non-missing cases could not be trusted (small sample) and quasi-separation problem - Models with all predictors together and the baseline model resulted in unstable and extreme estimates with standard errors not meaningful. - Adding more predictors makes the deviance drop but indicated overfitting/separation, and no true explanatory power. - BMI anly model contributes very little to the response varaible. Race and gender make models appear stronger, but the small sample (n=14) with complete separation, could not be generalized.
Considering the data anamolies, we decided to retain full N = ~9813, deal with small sample size and quasi-separation by conducting Multivariate Imputation by Chained Equations (MICE)
Multivariate Imputation by Chained Equations (MICE)
Bayesian Approach Buuren and Groothuis-Oudshoorn (2011)
Multiple imputation (MI) is an alternative Bayesian approach for small dataset.
Flatness of the density, heavy tails, non-zero peakedness, skewness and multimodality do not hamper the good performance of multiple imputation for the mean structure in samples n > 400 even for high percentages (75%) of missing data in one variable @Van Buuren and Van Buuren (2012).
Multiple Imputation (MI) use popular mice package in R) and adds sampling variability to the imputations.
Iterative Imputation (MICE) imputes missing values of one variable at a time, using regression models based on the other variables in the dataset.
This is a chain process, with each imputed variable becoming a predictor for the subsequent imputation and the entire process is repeated multiple times to create several complete datasets, each reflecting different possibilities for the missing data.
Each variable is imputed using its own appropriate univariate regression model.
Code
## Multiple Imputation performed # Subset variables for imputation in analytic_data dflibrary(dplyr)library(ggplot2)library(mice)library(VIM)library(janitor)# 1. Select variables for imputationvars <-c("ID", "BMI", "Age", "Gender", "Race", "PSU", "Strata", "Weight", "Diabetes")analytic_data <- merged_data[, vars]glimpse(merged_data)
# 2. Run mice to create 5 imputed datasetsimputed_data <-mice( analytic_data,m =5, # number of imputed datasetsmethod ='pmm', # predictive mean matchingseed =123)
ID BMI Age Gender
Min. :73557 Underweight : 218 Min. : 0.00 Male :4831
1st Qu.:76092 Normal weight:5107 1st Qu.:10.00 Female:4982
Median :78643 Overweight :1831 Median :27.00
Mean :78645 Obese :2657 Mean :31.63
3rd Qu.:81191 3rd Qu.:52.00
Max. :83731 Max. :80.00
Race PSU Strata
Mexican American :1685 Min. :1.000 Min. :104.0
Other Hispanic : 930 1st Qu.:1.000 1st Qu.:107.0
Non-Hispanic White :3538 Median :1.000 Median :111.0
Non-Hispanic Black :2198 Mean :1.483 Mean :110.9
Other Race - Including Multi-Racial:1462 3rd Qu.:2.000 3rd Qu.:115.0
Max. :2.000 Max. :118.0
Weight Diabetes
Min. : 3748 Yes:7252
1st Qu.: 13187 No :2561
Median : 20965
Mean : 31713
3rd Qu.: 37922
Max. :171395
Code
colSums(is.na(Imputed_data1))
ID BMI Age Gender Race PSU Strata Weight
0 0 0 0 0 0 0 0
Diabetes
0
Code
plot_intro(Imputed_data1, title="Figure 8. Structure of variables and missing observations.")
Code
plot_bar(Imputed_data1, title ="Figure 9. Frequency plots of categorical variables.")
Code
plot_correlation(na.omit(Imputed_data1[, c("BMI", "Diabetes")]), maxcat=5L, title ="Figure")
# Cross-tabulation# BMI vs Diabetestab_BMI <-table(Imputed_data1$BMI, Imputed_data1$Diabetes)print(tab_BMI)
Yes No
Underweight 159 59
Normal weight 3760 1347
Overweight 1342 489
Obese 1991 666
Code
prop.table(tab_BMI, 1) *100# row percentages
Yes No
Underweight 72.93578 27.06422
Normal weight 73.62444 26.37556
Overweight 73.29328 26.70672
Obese 74.93414 25.06586
Code
# Gender vs Diabetestab_gender <-table(Imputed_data1$Gender, Imputed_data1$Diabetes)prop.table(tab_gender, 1) *100
Yes No
Male 72.92486 27.07514
Female 74.84946 25.15054
Code
# Race vs Diabetestab_race <-table(Imputed_data1$Race, Imputed_data1$Diabetes)prop.table(tab_race, 1) *100
Yes No
Mexican American 71.03858 28.96142
Other Hispanic 74.40860 25.59140
Non-Hispanic White 76.17298 23.82702
Non-Hispanic Black 72.47498 27.52502
Other Race - Including Multi-Racial 73.52941 26.47059
Code
# Age vs Diabetestab_age <-table(Imputed_data1$Age, Imputed_data1$Diabetes)head (prop.table(tab_age, 1) *100)
# A tibble: 8 × 4
# Groups: BMI [4]
BMI Diabetes Count Percent
<fct> <fct> <int> <dbl>
1 Underweight Yes 159 72.9
2 Underweight No 59 27.1
3 Normal weight Yes 3760 73.6
4 Normal weight No 1347 26.4
5 Overweight Yes 1342 73.3
6 Overweight No 489 26.7
7 Obese Yes 1991 74.9
8 Obese No 666 25.1
Code
# 6. Frequency tables for categorical variablescategorical_vars <-c("BMI", "Gender", "Race", "Diabetes")for (var in categorical_vars) {cat("\nFrequency table for", var, ":\n")print(table(Imputed_data1[[var]]))print(round(prop.table(table(Imputed_data1[[var]])), 3))}
Frequency table for BMI :
Underweight Normal weight Overweight Obese
218 5107 1831 2657
Underweight Normal weight Overweight Obese
0.022 0.520 0.187 0.271
Frequency table for Gender :
Male Female
4831 4982
Male Female
0.492 0.508
Frequency table for Race :
Mexican American Other Hispanic
1685 930
Non-Hispanic White Non-Hispanic Black
3538 2198
Other Race - Including Multi-Racial
1462
Mexican American Other Hispanic
0.172 0.095
Non-Hispanic White Non-Hispanic Black
0.361 0.224
Other Race - Including Multi-Racial
0.149
Frequency table for Diabetes :
Yes No
7252 2561
Yes No
0.739 0.261
Code
# 7. Summary statistics for continuous variablescontinuous_vars <-c("Age")for (var in continuous_vars) {cat("\nSummary statistics for", var, ":\n")print(summary(Imputed_data1[[var]]))print(paste("SD:", round(sd(Imputed_data1[[var]]), 2)))}
Summary statistics for Age :
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00 10.00 27.00 31.63 52.00 80.00
[1] "SD: 24.4"
Code
# 8. Bar plots for categorical variablesfor (var in categorical_vars) {ggplot(Imputed_data1, aes_string(x = var)) +geom_bar(fill ="steelblue") +labs(title =paste("Bar plot of", var), y ="Count") +theme_minimal() -> pprint(p)}
Code
# 9. Histograms for continuous variablesfor (var in continuous_vars) {ggplot(Imputed_data1, aes_string(x = var)) +geom_histogram(fill ="skyblue", color ="black", bins =30) +labs(title =paste("Histogram of", var), y ="Frequency") +theme_minimal() -> pprint(p)}
Code
# 10. Scatter plot example (BMI vs Age)ggplot(Imputed_data1, aes(x = Age, y = BMI, color = BMI)) +geom_point(alpha =0.6) +labs(title ="Scatter plot of BMI vs Age", y ="BMI", x ="Age") +theme_minimal()
Code
# 11. Relative breakdown of Diabetes by BMIggplot(breakdown_BMI, aes(x = BMI, y = Percent, fill = Diabetes)) +geom_col(position ="fill") +scale_y_continuous(labels = scales::percent) +labs(title ="Relative Breakdown of Diabetes by BMI Category",x ="BMI Category", y ="Proportion" ) +theme_minimal()
Code
# 12. Crosstab with percentagesImputed_data1 %>%tabyl(Diabetes, BMI) %>%adorn_percentages("col")
Diabetes Underweight Normal weight Overweight Obese
Yes 0.7293578 0.7362444 0.7329328 0.7493414
No 0.2706422 0.2637556 0.2670672 0.2506586
Code
# 13. Margin plot for BMI vs Diabetesmarginplot( Imputed_data1[, c("BMI", "Diabetes")],col =mdc(1:2, trans =FALSE),cex =1.2,cex.lab =1.2,cex.numbers =1.3,pch =19)
Results from MICE:
MI resulted in 9813 observations with no NAs.
A bar plot on missingness is presented below.
A heatmap of correlation between BMI and Diabetes status reported no strong linear association between variables.
Chi-square p-value = 0.5461, which is > 0.05 revealing no evidence of association.
Imputed data check in marginal plot - shows that the distribution of imputed points is consistent with observed data
Comparison of multiple imputation and Bayesian data augmentation
Multiple imputation
Bayesian data augmentation
frequentist approach and requires no priors, and has moderate flexibility
performs missing data imputation and regression model fitting simultaneously
Markov Chain Monte Carlo (MCMC) draws samples from the joint posterior of regression parameters, missing values and provide complete datasets by extracting posterior means, credible intervals, and probabilities
handles missing values first by imputation, performs regression analysis, pools results
performed on the data with missingness
shrink extreme estimates back toward plausible values
propagate uncertainty added after analysis (pooling).
handles uncertainty in missing values fully propagated through the model, naturally handles small or sparse datasets and separation problems.
# Log-odds (link)Imputed_data1$logit <-predict(m_imp, type ="link") ## log (Odds) # ProbabilityImputed_data1$prob <-exp(Imputed_data1$logit) / (1+exp(Imputed_data1$logit)) # prob # Plot predicted probability vs Ageggplot(Imputed_data1, aes(x = Age, y = prob)) +geom_point(alpha =0.3) +geom_smooth(method ="loess") +labs(x ="Age", y ="Predicted Probability of Diabetes")
Findings from regression of MI data set
MLR on imputed data (Frequentist approach)
Relationship between Age and log-odds of diabetes are roughly linear but not perfectly, but are acceptable for logistic regression assumptions.
Generalized Variance Inflation Factor (vif- adjusted report there is no collinearity between predictors (GVIF between ~1.0–1.04) and we can run model without removing or dropping or combining variables.
Cooks distance and influential points, we found - Most data points are safe, not influencing the model In the data with (~9813 cases), cutoff ≈ 0.0004. A cluster at high leverage shows unusual predictor values, but not high influence. A few above Cook’s Distance cutoff: worth checking individually, but no major threat to model stability. no outliers detected (not suspected = 9813)
Results from Hosmer–Lemeshow (H–L) at alpha =0.05, with p < 0.001, we find our logistic regression model does not fit the data well.
Graph shows Residual vs fitted (imputed data model)
Results from Bayesian Data Augmentation and logistic regression
We incorporate prior knowledge that BMI increases diabetes odds by .,
We use priors for Bayesian logistic regression and compare the models with different priors in the model
Prior (intercept) - We use intercept prior from this study data
Prior (coefficients) - BMI, Age, gender
Weak prior N (0,2.5) -βBMI∼N(μ,σ2)
A common approach is to use a normal distribution, βBMI∼N(μ,σ2), for the regression coefficient.
# Residual vs fitted for imputed dataplot(m_imp$fitted.values, resid(m_imp),xlab ="Fitted values",ylab ="Residuals",main ="Residuals vs Fitted",pch =19, col ="blue")abline(h =0, col ="red", lwd =2)
Hosmer-Lemeshow test - was conducted to test for Goodness of fit of multivariate logistic regression model adjR2(m_imp) CHi-square test Visualization of the model (fitted vs residula values)
To overcome the quasi-separation issue in the data, Firth (penalized regression model) was conducted and the summary presented with only 14 complete observations.
Next, to deal with the small sample size, imputation (MICE) was conducted along with the regression predicting Diabetes ~ BMI, Age, Gender, Race.
Summar and visualization of one dataset extracted from the 5 datasets from MICE is presented here with along wiht the predicted values of the imputed model (m_imp), and plots of cross-tabulation between variables and response variable (Diabetes)
Code
# Frequentist logistic regression on raw datd - Firth logistic regression (penalized regression)library(logistf)m_firth <-logistf(Diabetes ~ BMI + Age + Gender + Race,data = merged_data)summary(m_firth)
logistf(formula = Diabetes ~ BMI + Age + Gender + Race, data = merged_data)
Model fitted by Penalized ML
Coefficients:
coef se(coef) lower 0.95
(Intercept) 2.2909182 2.0520292 -1.7059362
BMIOverweight 2.5151817 1.9838969 -1.2146946
BMIObese -0.4519637 1.8778881 -5.4843160
Age -0.2545704 0.1911568 -1.4579663
GenderFemale -2.1675543 2.0190594 -17.0316807
RaceOther Hispanic 0.9518795 1.9225099 -12.0194116
RaceNon-Hispanic White -2.5671499 2.2189914 -21.6563170
RaceNon-Hispanic Black 3.3280688 2.2005227 -0.4678607
RaceOther Race - Including Multi-Racial 1.3072954 2.0609641 -2.6483209
upper 0.95 Chisq p method
(Intercept) 16.1697706 1.25690744 0.26223728 2
BMIOverweight 21.9385716 1.68994830 0.19360778 2
BMIObese 3.4687761 0.05538832 0.81393924 2
Age 0.1148471 1.81192970 0.17827692 2
GenderFemale 5.9967059 0.69556457 0.40427809 2
RaceOther Hispanic 10.8617660 0.20246895 0.65273532 2
RaceNon-Hispanic White 1.6700624 1.36287409 0.24304000 2
RaceNon-Hispanic Black 25.3575974 2.86971841 0.09026066 2
RaceOther Race - Including Multi-Racial 21.7820826 0.39215076 0.53117103 2
Method: 1-Wald, 2-Profile penalized log-likelihood, 3-None
Likelihood ratio test=6.066658 on 8 df, p=0.6397653, n=14
Wald test = 5.257182 on 8 df, p = 0.7297677
Code
# Imputed data plots ## pred_prob_imputed #Imputed_data1 <- Imputed_data1 %>%mutate(pred_prob_imputed =predict(m_imp, type ="response")) # predicted probabilitiesImputed_plot <- Imputed_data1 %>%select(BMI, pred_prob_imputed) %>%mutate(Source ="Imputed")# Rename probability column to common nameImputed_plot <- Imputed_plot %>%rename(Pred_Prob = pred_prob_imputed)ggplot(Imputed_data1, aes(x = BMI, y = pred_prob_imputed, fill = BMI)) +geom_boxplot(alpha =0.6) +geom_jitter(width =0.2, alpha =0.3, size =1) +labs(x ="BMI Category", y ="Predicted Probability of Diabetes",title ="Predicted Diabetes Probability by BMI Category") +theme_minimal()
Code
merged_data_clean <- merged_data %>%filter(!is.na(BMI), !is.na(Diabetes))ggplot(merged_data_clean, aes(x = BMI, fill =factor(Diabetes))) +geom_bar(position ="fill") +scale_y_continuous(labels = scales::percent) +labs(x ="BMI Category", y ="Proportion with Diabetes",fill ="Diabetes",title ="Observed Diabetes Proportions by BMI Category") +theme_minimal()
Code
ggplot(Imputed_data1, aes(x = Race, y = pred_prob_imputed, fill = Race)) +geom_boxplot(alpha =0.6) +geom_jitter(width =0.2, alpha =0.3, size =1) +labs(x ="Race ", y ="Predicted Probability of Diabetes",title ="Predicted Diabetes Probability by Race ") +theme_minimal()
Code
merged_data_clean_Race <- merged_data %>%filter(!is.na(Race), !is.na(Diabetes))ggplot(merged_data_clean, aes(x = Race, fill =factor(Diabetes))) +geom_bar(position ="fill") +scale_y_continuous(labels = scales::percent) +labs(x ="Race ", y ="Proportion with Diabetes",fill ="Diabetes",title ="Observed Diabetes Proportions by Race ") +theme_minimal()
Proportion of Diabetes status and the group category (age <40 and >40) is tabulated below
Code
ggplot(merged_data, aes(x = Age, y = Diabetes)) +geom_jitter(size =0.2)
Code
ggplot(Imputed_data1, aes(x = Age, y = Diabetes)) +geom_jitter(size =0.2)
Code
# Create age groups# Create contingency table with DiabetesImputed_data1$Age_group <-ifelse(Imputed_data1$Age <40, "<40", ">=40")tab_age <-table(Imputed_data1$Age_group, Imputed_data1$Diabetes)prop_age <-prop.table(tab_age, 1) *100tab_age
Yes No
<40 4414 1691
>=40 2838 870
Code
prop_age
Yes No
<40 72.30139 27.69861
>=40 76.53722 23.46278
Code
# Convert table to data framedf_age <-as.data.frame(tab_age)names(df_age) <-c("Age_group", "Diabetes", "Count") # rename columns
Code
## Reference: Gelman et al., 2008, “Weakly informative priors: Normal(0, 2.5) for coefficients (b) and Normal(0, 5) for the intercept as default weakly informative priors for logistic regression ### bayesian logitic regression ## library(brms)priors <-c(set_prior("normal(0, 2.5)", class ="b"), # for coefficientsset_prior("normal(0, 5)", class ="Intercept") # for intercept)formula_bayes <-bf(Diabetes ~ Age + BMI + Gender + Race)Diabetes_prior <-brm(formula = formula_bayes,data = Imputed_data1,family =bernoulli(link ="logit"), # logistic regressionprior = priors,chains =4,iter =2000,seed =123,control =list(adapt_delta =0.95))
Running /opt/R/4.4.2/lib/R/bin/R CMD SHLIB foo.c
using C compiler: ‘gcc (GCC) 11.5.0 20240719 (Red Hat 11.5.0-2)’
gcc -I"/opt/R/4.4.2/lib/R/include" -DNDEBUG -I"/opt/R/4.4.2/lib/R/library/Rcpp/include/" -I"/opt/R/4.4.2/lib/R/library/RcppEigen/include/" -I"/opt/R/4.4.2/lib/R/library/RcppEigen/include/unsupported" -I"/opt/R/4.4.2/lib/R/library/BH/include" -I"/opt/R/4.4.2/lib/R/library/StanHeaders/include/src/" -I"/opt/R/4.4.2/lib/R/library/StanHeaders/include/" -I"/opt/R/4.4.2/lib/R/library/RcppParallel/include/" -I"/opt/R/4.4.2/lib/R/library/rstan/include" -DEIGEN_NO_DEBUG -DBOOST_DISABLE_ASSERTS -DBOOST_PENDING_INTEGER_LOG2_HPP -DSTAN_THREADS -DUSE_STANC3 -DSTRICT_R_HEADERS -DBOOST_PHOENIX_NO_VARIADIC_EXPRESSION -D_HAS_AUTO_PTR_ETC=0 -include '/opt/R/4.4.2/lib/R/library/StanHeaders/include/stan/math/prim/fun/Eigen.hpp' -D_REENTRANT -DRCPP_PARALLEL_USE_TBB=1 -I/usr/local/include -fpic -g -O2 -c foo.c -o foo.o
In file included from /opt/R/4.4.2/lib/R/library/RcppEigen/include/Eigen/Core:19,
from /opt/R/4.4.2/lib/R/library/RcppEigen/include/Eigen/Dense:1,
from /opt/R/4.4.2/lib/R/library/StanHeaders/include/stan/math/prim/fun/Eigen.hpp:22,
from <command-line>:
/opt/R/4.4.2/lib/R/library/RcppEigen/include/Eigen/src/Core/util/Macros.h:679:10: fatal error: cmath: No such file or directory
679 | #include <cmath>
| ^~~~~~~
compilation terminated.
make: *** [/opt/R/4.4.2/lib/R/etc/Makeconf:195: foo.o] Error 1
Family: bernoulli
Links: mu = logit
Formula: Diabetes ~ Age + BMI + Gender + Race
Data: Imputed_data1 (Number of observations: 9813)
Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
total post-warmup draws = 4000
Regression Coefficients:
Estimate Est.Error l-95% CI u-95% CI Rhat
Intercept -0.77 0.16 -1.09 -0.44 1.00
Age -0.00 0.00 -0.01 -0.00 1.00
BMINormalweight 0.03 0.16 -0.27 0.34 1.00
BMIOverweight 0.08 0.16 -0.24 0.40 1.00
BMIObese 0.03 0.16 -0.29 0.35 1.00
GenderFemale -0.09 0.05 -0.18 -0.00 1.00
RaceOtherHispanic -0.15 0.09 -0.34 0.03 1.00
RaceNonMHispanicWhite -0.21 0.07 -0.34 -0.08 1.00
RaceNonMHispanicBlack -0.06 0.07 -0.20 0.09 1.00
RaceOtherRaceMIncludingMultiMRacial -0.11 0.08 -0.26 0.05 1.00
Bulk_ESS Tail_ESS
Intercept 1834 1902
Age 4495 2737
BMINormalweight 1820 1960
BMIOverweight 1899 2263
BMIObese 1884 1952
GenderFemale 3564 2756
RaceOtherHispanic 2685 2528
RaceNonMHispanicWhite 2156 2917
RaceNonMHispanicBlack 2444 2590
RaceOtherRaceMIncludingMultiMRacial 2311 2896
Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
Code
plot(Diabetes_prior)
Code
pp_check(Diabetes_prior)
Code
## Draws # Generate fitted draws directly with brmsfitted_draws <-fitted( Diabetes_prior,newdata = Imputed_data1,summary =FALSE, # gives all posterior draws instead of summarynsamples =100# limit to 100 draws)# Convert to long format manuallyfitted_long <-data.frame(BMI =rep(Imputed_data1$BMI, times =100),draw =rep(1:100, each =nrow(Imputed_data1)),.value =as.vector(t(fitted_draws)))### BMI Plot the fitted linesggplot(fitted_long, aes(x = BMI, y = .value, group = draw)) +geom_line(size =0.1, alpha =0.4) +labs(y ="Predicted probability of Diabetes")
Code
### Agefitted_long <-data.frame(Age =rep(Imputed_data1$Age, times =100),draw =rep(1:100, each =nrow(Imputed_data1)),.value =as.vector(t(fitted_draws)))ggplot(fitted_long, aes(x = Age, y = .value, group = draw)) +geom_line(size =0.1, alpha =0.4) +labs(y ="Predicted probability of Diabetes")
Code
### Racefitted_long <-data.frame(Race =rep(Imputed_data1$Race, times =100),draw =rep(1:100, each =nrow(Imputed_data1)),.value =as.vector(t(fitted_draws)))ggplot(fitted_long, aes(x = Race, y = .value, group = draw)) +geom_line(size =0.1, alpha =0.4) +labs(y ="Predicted probability of Diabetes")
Code
### Genderfitted_long <-data.frame(Gender =rep(Imputed_data1$Gender, times =100),draw =rep(1:100, each =nrow(Imputed_data1)),.value =as.vector(t(fitted_draws)))ggplot(fitted_long, aes(x = Gender, y = .value, group = draw)) +geom_line(size =0.1, alpha =0.4) +labs(y ="Predicted probability of Diabetes")
Code
### Age cut by 10 years and Diabetes plot and histogram (imputed data) Imputed_data1 %>%mutate(age_bracket =cut(Age, breaks =seq(10, 100, by =10))) %>%group_by(age_bracket) %>%summarise(Diabetes =mean(Diabetes =="Yes")) %>%ggplot(aes(x = age_bracket, y = Diabetes)) +geom_point() +theme(axis.text.x =element_text(angle =45, vjust =0.5))
Code
### predict values of 100 draws, Simulated predictions (binary diabetes outcome)pred <-posterior_predict(Diabetes_prior, newdata = Imputed_data1, draws =100)# data frame for summarizingpred_df <-as.data.frame(t(pred)) # proportion of diabetes = 1 per drawprop_diabetes <-colMeans(pred_df ==1)prop_df <-tibble(draw =1:length(prop_diabetes),proportion_Diabetes = prop_diabetes ## proportion of Diabetes with age cut category)library(ggplot2)ggplot(prop_df, aes(x = proportion_Diabetes)) +geom_histogram(color ="white")
Bayesian Logistic Regression Model - prior (weakly informative prior used) - complilation, iterations, and posterior draws using NUTS sampling - Fitted draws from the model posterior (n=100) were analyzed - estimates, Rhat were analyzed for convergence. - plots are presented below - Histogram of predicted values (n=100 draws), shows observed proportion of Diabetes ostatus. - - Scatterplot of the proportion of Diabetes grouped by age.
We created posterior model from the posterior draws (100), and analysed our simulated prior model. - plotted posterior predicted values of Diabetes against Age, Race and BMI
Code
library(brms)library(GGally)# Simulate the modelDiabetes_model_1 <-update(Diabetes_prior,sample_prior ="yes"# includes priors + data likelihood)
Running /opt/R/4.4.2/lib/R/bin/R CMD SHLIB foo.c
using C compiler: ‘gcc (GCC) 11.5.0 20240719 (Red Hat 11.5.0-2)’
gcc -I"/opt/R/4.4.2/lib/R/include" -DNDEBUG -I"/opt/R/4.4.2/lib/R/library/Rcpp/include/" -I"/opt/R/4.4.2/lib/R/library/RcppEigen/include/" -I"/opt/R/4.4.2/lib/R/library/RcppEigen/include/unsupported" -I"/opt/R/4.4.2/lib/R/library/BH/include" -I"/opt/R/4.4.2/lib/R/library/StanHeaders/include/src/" -I"/opt/R/4.4.2/lib/R/library/StanHeaders/include/" -I"/opt/R/4.4.2/lib/R/library/RcppParallel/include/" -I"/opt/R/4.4.2/lib/R/library/rstan/include" -DEIGEN_NO_DEBUG -DBOOST_DISABLE_ASSERTS -DBOOST_PENDING_INTEGER_LOG2_HPP -DSTAN_THREADS -DUSE_STANC3 -DSTRICT_R_HEADERS -DBOOST_PHOENIX_NO_VARIADIC_EXPRESSION -D_HAS_AUTO_PTR_ETC=0 -include '/opt/R/4.4.2/lib/R/library/StanHeaders/include/stan/math/prim/fun/Eigen.hpp' -D_REENTRANT -DRCPP_PARALLEL_USE_TBB=1 -I/usr/local/include -fpic -g -O2 -c foo.c -o foo.o
In file included from /opt/R/4.4.2/lib/R/library/RcppEigen/include/Eigen/Core:19,
from /opt/R/4.4.2/lib/R/library/RcppEigen/include/Eigen/Dense:1,
from /opt/R/4.4.2/lib/R/library/StanHeaders/include/stan/math/prim/fun/Eigen.hpp:22,
from <command-line>:
/opt/R/4.4.2/lib/R/library/RcppEigen/include/Eigen/src/Core/util/Macros.h:679:10: fatal error: cmath: No such file or directory
679 | #include <cmath>
| ^~~~~~~
compilation terminated.
make: *** [/opt/R/4.4.2/lib/R/etc/Makeconf:195: foo.o] Error 1
# BMI# Posterior fitted values (probabilities of Diabetes)fitted_draws <-fitted( Diabetes_model_1, # <-- use posterior model herenewdata = Imputed_data1,summary =FALSE, # full posterior drawsnsamples =100# sample 100 posterior draws)# Reshape into long formatfitted_long <-data.frame(BMI =rep(Imputed_data1$BMI, times =100),draw =rep(1:100, each =nrow(Imputed_data1)),.value =as.vector(t(fitted_draws)))# Plot posterior fitted lines (BMI)library(ggplot2)ggplot(fitted_long, aes(x = BMI, y = .value, group = draw)) +geom_line(size =0.1, alpha =0.4) +labs(y ="Predicted probability of Diabetes",title ="Posterior fitted draws by BMI" )
Code
# Age# Reshape into long formatfitted_long <-data.frame(Age =rep(Imputed_data1$Age, times =100),draw =rep(1:100, each =nrow(Imputed_data1)),.value =as.vector(t(fitted_draws)))# Plot posterior fitted lineslibrary(ggplot2)ggplot(fitted_long, aes(x = Age, y = .value, group = draw)) +geom_line(size =0.1, alpha =0.4) +labs(y ="Predicted probability of Diabetes",title ="Posterior fitted draws by Age" )
Code
# Race# Reshape into long formatfitted_long <-data.frame(Race =rep(Imputed_data1$Race, times =100),draw =rep(1:100, each =nrow(Imputed_data1)),.value =as.vector(t(fitted_draws)))# Plot posterior fitted lineslibrary(ggplot2)ggplot(fitted_long, aes(x = Race, y = .value, group = draw)) +geom_line(size =0.1, alpha =0.4) +labs(y ="Predicted probability of Diabetes",title ="Posterior fitted draws by Race" )
Code
fitted_long <-data.frame(Gender =rep(Imputed_data1$Gender, times =100),draw =rep(1:100, each =nrow(Imputed_data1)),.value =as.vector(t(fitted_draws)))# Plot posterior fitted lineslibrary(ggplot2)ggplot(fitted_long, aes(x = Gender, y = .value, group = draw)) +geom_line(size =0.1, alpha =0.4) +labs(y ="Predicted probability of Diabetes",title ="Posterior fitted draws by Gender" )
We performed Bayesian logistic regression to model the probability of diabetes as a function of age, BMI, gender, and race. -
We used Weakly informative normal priors Gosho et al. (2025) on all regression coefficients. Models were fit using four MCMC chains with 2000 iterations each, including 1000 warm-up iterations. Convergence was assessed using Rhat values, effective sample size, and trace plots. Posterior predictive checks were used to evaluate model fit, and coefficients were exponentiated to report odds ratios with 95% credible intervals
The cluster of points representing the higher probability of diabetes appears to be denser among individuals in the middle to older Age ranges (e.g., roughly from 40 to 80 years old), compared to the younger Age ranges, although diabetes is present even at younger Ages.
Comparing Models
Linear regression model on raw data
Multivariate logistic regression on imputed dataset (MI + MLR)
Bayesian Logistic Regression on imputed data The spread of these lines provides an indication of the variability or uncertainty in the predicted probabilities within each BMI group (posterior model). the plots visually demonstrate the well-established relationship between BMI and the predicted probability of diabetes, with the risk significantly increasing as BMI moves from normal weight to overweight and obese categories
All Results Summarized
Sample Characteristics (merged data)
Number of participants = 9,813 participants with the age range of 0 to 80 years (mean = 31.6 years, median = 27 years, IQR = 10–52 years), with 4,831 males (49.2%) and 4,982 females (50.8%). Largest groups were Non-Hispanic White (n = 3,538, 36.0%) and Non-Hispanic Black (n = 2,198, 22.4%), followed by Mexican American (n = 1,685, 17.2%), Other Hispanic (n = 930, 9.5%), and Other Race/Multiracial (n = 1,462, 14.9%).
Body mass index (BMI) categories of 132 participants (1.3%) classified as underweight, 2,167 (22.1%) as normal weight, 595 (6.1%) as overweight, and 629 (6.4%) as obese; however, BMI data were missing for 6,290 participants (64.1%).
Total of 553 participants (5.6%) reported having diabetes, while 169 (1.7%) reported no diabetes. A substantial proportion (9,091 participants, 92.6%) had missing diabetes data.
Survey design variables included PSU (range = 1–2), strata (range = 104–118), and sampling weights (mean = 31,713; range = 3,748–171,395).
The imputed sample included 9,813 participants with age range of 0 to 80 years (mean = 31.6 years, median = 27 years, IQR = 10–52 years). Gender distribution remained nearly balanced with 4,831 males (49.2%) and 4,982 females (50.8%).
Race/ethnicity composition was Non-Hispanic White (n = 3,538, 36.0%), Non-Hispanic Black (n = 2,198, 22.4%), Mexican American (n = 1,685, 17.2%), Other Hispanic (n = 930, 9.5%), and Other Race/Multiracial (n = 1,462, 14.9%).
Body mass index (BMI) distribution of 218 participants (2.2%) classified as underweight, 5,107 (52.0%) as normal weight, 1,831 (18.7%) as overweight, and 2,657 (27.1%) as obese after imputation.
For diabetes status, 7,252 participants (73.9%) were classified as having diabetes and 2,561 (26.1%) as not having diabetes. The imputation model included predicted log-odds of diabetes (mean logit = −1.05, range = −1.43 to −0.69) and corresponding predicted probabilities (mean = 0.26, range = 0.19–0.33), which were used to generate imputed diabetes outcomes.
Survey design variables were preserved, including PSU (range = 1–2), strata (range = 104–118), and sampling weights (mean = 31,713; range = 3,748–171,395).
Multivariable logistic regression model to examine the association between demographic and anthropometric characteristics and diabetes status in the imputed dataset (N = 9,813).
Age was significantly associated with diabetes, with each additional year corresponding to a small reduction in odds (OR = 0.995, 95% CI: 0.993–0.997, p < 0.001).
Females had lower odds of diabetes compared with males (OR = 0.91, 95% CI: 0.84–1.00, p = 0.044).
Compared with Mexican Americans (reference), Non-Hispanic Whites had significantly lower odds of diabetes (OR = 0.81, 95% CI: 0.71–0.92, p = 0.002). Other Hispanic, Non-Hispanic Black, and Other Race/Multiracial groups were not significantly different from the reference group (all p > 0.09).
BMI categories were not significantly associated with diabetes (all p > 0.65).
Model diagnostics:
ANOVA (Type II sequential tests): Age (χ²(1) = 31.17, p < 0.001), Gender (χ²(1) = 3.93, p = 0.048), and Race (χ²(4) = 12.24, p = 0.016) significantly improved model fit, whereas BMI (χ²(3) = 0.72, p = 0.868) did not.
Goodness-of-fit: The Hosmer–Lemeshow test indicated poor model fit (χ²(8) = 12,190, p < 0.001).
Predictor independence: Pearson’s chi-squared test across BMI categories was nonsignificant (p = 0.546), indicating no strong association between BMI and diabetes.
Multicollinearity: Variance inflation factors (VIFs) were all close to 1, indicating no collinearity issues.
Explained variance: The adjusted R² for the model was very low (Adj. R² = 0.003), indicating that the predictors explained <1% of the variance in diabetes status.
Summary: Age, gender, and race/ethnicity (particularly Non-Hispanic White vs. Mexican American) were significant predictors of diabetes, while BMI categories were not.
Regression Analyses
Multiple modeling strategies, including frequentist, penalized likelihood, and Bayesian approaches, were conducted to assess predictors of diabetes.
Firth Penalized Logistic Regression
To analyze a sparse dataset (n = 14), Firth’s reduced bias in logistic regression.
None of the predictors (BMI, age, gender, race) reached statistical significance (all p > 0.09).
Likelihood ratio and Wald tests were nonsignificant (LRT = 6.07, p = 0.64; Wald = 5.26, p = 0.73), suggesting the model provided limited explanatory power in the small sample.
Frequentist Logistic Regression (Imputed Dataset)
Analysis of imputed dataset (N = 9,813), ANOVA tests showed, age (OR ≈ 0.995, p < 0.001), gender (female vs. male: OR ≈ 0.91, p = 0.044), and race/ethnicity (Non-Hispanic White vs. Mexican American: OR ≈ 0.81, p = 0.002) were significantly associated with diabetes status.
BMI categories were not significant predictors. The adjusted R² was low (Adj. R² = 0.003), indicating limited variance explained.
Bayesian Logistic Regression
Bayesian estimation with weakly informative priors yielded results consistent with the frequentist model.
Posterior means and 95% credible intervals indicated that increasing age (Estimate = –0.00, 95% CrI: –0.01 to –0.00), female gender (Estimate = –0.09, 95% CrI: –0.18 to 0.00), and Non-Hispanic White ethnicity (Estimate = –0.21, 95% CrI: –0.34 to –0.08) were associated with lower odds of diabetes.
BMI categories again showed no significant associations. Convergence diagnostics were satisfactory (Rhat = 1.00 for all parameters; Bulk ESS > 1,800).
Bayesian logistic regression model is well-calibrated to reproduce the data distribution, the likelihood assumption (Bernoulli with logit link) is appropriate, no discrepancies (e.g., gray lines shifted away from the observed density), observed, suggesting model fit.
Conclusion
Using multiple imputation (MI) allowed inclusion of all 9,813 participants, increasing power and reducing bias compared to a complete-case analysis. In epidemiologic studies, MI is critical when missingness is non-negligible, especially for outcomes like diabetes.
Across modeling frameworks, age, gender, and race/ethnicity consistently emerged as predictors of diabetes status, while BMI was not associated. The penalized regression model (n = 14) had insufficient power, whereas both frequentist and Bayesian analyses on the imputed dataset (N ≈ 9,813) showed concordant results, validating the robustness of findings.
Posterior distributions and credible intervals from the Bayesian method, provided , offered a probabilistic interpretation of effect sizes that is more intuitive in uncertainty quantification by incorporating prior knowledge.
The Firth penalized logistic regression on sparse data, underscores the importance of adequate sample size for reliable inference, reduce small-sample bias, but they cannot compensate for extreme sparsity.
Low adjusted R² (~0.003) and Hosmer-Lemeshow test results suggest incorporating additional behavioral, clinical, and biological variables to improve explanatory performance.
Low Multicollinearity Assessment emphasize on proper checking of multicollinearity to ensure coefficient estimates are reliable and interpretable.
Sequential deviance testing is a useful diagnostic to identify the contribution of individual predictors in logistic regression, complementing coefficient-based inference.
Handling missing data (via MI) and comparing frequentist, Bayesian, and penalized approaches strengthens the credibility of results.
Discussions
The use of multiple imputation allowed for robust analysis despite missing data, increasing power and reducing bias. Comparison of frequentist and Bayesian models demonstrated consistency in significant predictors, while Bayesian approaches provided the advantage of posterior distributions and probabilistic interpretation. The Firth penalized regression highlighted the limitations of small-sample analyses, showing wide confidence intervals and nonsignificant results when data were sparse.
References
Codes for summaries and tables
Code
summary (merged_data)
ID BMI Age Gender
Min. :73557 Underweight : 132 Min. : 0.00 Male :4831
1st Qu.:76092 Normal weight:2167 1st Qu.:10.00 Female:4982
Median :78643 Overweight : 595 Median :27.00
Mean :78645 Obese : 629 Mean :31.63
3rd Qu.:81191 NA's :6290 3rd Qu.:52.00
Max. :83731 Max. :80.00
Race PSU Strata
Mexican American :1685 Min. :1.000 Min. :104.0
Other Hispanic : 930 1st Qu.:1.000 1st Qu.:107.0
Non-Hispanic White :3538 Median :1.000 Median :111.0
Non-Hispanic Black :2198 Mean :1.483 Mean :110.9
Other Race - Including Multi-Racial:1462 3rd Qu.:2.000 3rd Qu.:115.0
Max. :2.000 Max. :118.0
Weight Diabetes
Min. : 3748 Yes : 553
1st Qu.: 13187 No : 169
Median : 20965 NA's:9091
Mean : 31713
3rd Qu.: 37922
Max. :171395
Code
summary (Imputed_data1)
ID BMI Age Gender
Min. :73557 Underweight : 218 Min. : 0.00 Male :4831
1st Qu.:76092 Normal weight:5107 1st Qu.:10.00 Female:4982
Median :78643 Overweight :1831 Median :27.00
Mean :78645 Obese :2657 Mean :31.63
3rd Qu.:81191 3rd Qu.:52.00
Max. :83731 Max. :80.00
Race PSU Strata
Mexican American :1685 Min. :1.000 Min. :104.0
Other Hispanic : 930 1st Qu.:1.000 1st Qu.:107.0
Non-Hispanic White :3538 Median :1.000 Median :111.0
Non-Hispanic Black :2198 Mean :1.483 Mean :110.9
Other Race - Including Multi-Racial:1462 3rd Qu.:2.000 3rd Qu.:115.0
Max. :2.000 Max. :118.0
Weight Diabetes logit prob Diabetes_num
Min. : 3748 Yes:7252 Min. :-1.4253 Min. :0.1938 Min. :0.000
1st Qu.: 13187 No :2561 1st Qu.:-1.1612 1st Qu.:0.2384 1st Qu.:0.000
Median : 20965 Median :-1.0383 Median :0.2615 Median :1.000
Mean : 31713 Mean :-1.0471 Mean :0.2610 Mean :0.739
3rd Qu.: 37922 3rd Qu.:-0.9230 3rd Qu.:0.2843 3rd Qu.:1.000
Max. :171395 Max. :-0.6904 Max. :0.3339 Max. :1.000
pred_prob_imputed Age_group
Min. :0.1938 Length:9813
1st Qu.:0.2384 Class :character
Median :0.2615 Mode :character
Mean :0.2610
3rd Qu.:0.2843
Max. :0.3339
Code
summary(m_imp)
Call:
glm(formula = Diabetes ~ Age + Gender + Race + BMI, family = binomial,
data = Imputed_data1)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.761926 0.162425 -4.691 2.72e-06
Age -0.004728 0.001011 -4.678 2.89e-06
GenderFemale -0.092730 0.046133 -2.010 0.04442
RaceOther Hispanic -0.154250 0.092549 -1.667 0.09558
RaceNon-Hispanic White -0.212663 0.067746 -3.139 0.00169
RaceNon-Hispanic Black -0.055537 0.072128 -0.770 0.44131
RaceOther Race - Including Multi-Racial -0.109388 0.080345 -1.361 0.17336
BMINormal weight 0.024339 0.156361 0.156 0.87630
BMIOverweight 0.071544 0.162658 0.440 0.66005
BMIObese 0.020287 0.161137 0.126 0.89981
(Intercept) ***
Age ***
GenderFemale *
RaceOther Hispanic .
RaceNon-Hispanic White **
RaceNon-Hispanic Black
RaceOther Race - Including Multi-Racial
BMINormal weight
BMIOverweight
BMIObese
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 11267 on 9812 degrees of freedom
Residual deviance: 11219 on 9803 degrees of freedom
AIC: 11239
Number of Fisher Scoring iterations: 4
Code
coef(m_imp)
(Intercept) Age
-0.761925862 -0.004727998
GenderFemale RaceOther Hispanic
-0.092730385 -0.154250348
RaceNon-Hispanic White RaceNon-Hispanic Black
-0.212662854 -0.055536906
RaceOther Race - Including Multi-Racial BMINormal weight
-0.109388375 0.024339471
BMIOverweight BMIObese
0.071543638 0.020286793
Ali, Sakhawat, Rizwana Hussain, Rohaib A Malik, Raheema Amin, and Muhammad N Tariq. 2024. “Association of Obesity With Type 2 Diabetes Mellitus: A Hospital-Based Unmatched Case-Control Study.”Cureus 16 (1): 1–7. https://doi.org/10.7759/cureus.52728.
Buuren, Stef van, and Karin Groothuis-Oudshoorn. 2011. “mice: Multivariate Imputation by Chained Equations in R.”Journal of Statistical Software 45 (3): 1–67. https://doi.org/10.18637/jss.v045.i03.
Center for Health Statistics, National. 1999. “Vital and Health Statistics Reports Series 2, Number 161, September 2013.”National Health and Nutrition Examination Survey: Analytic Guidelines, 1999–2010. https://wwwn.cdc.gov/nchs/data/series/sr02_161.pdf.
Chatzimichail, Theodora, and Aristides T. Hatjimihail. 2023. “A Bayesian Inference Based Computational Tool for Parametric and Nonparametric Medical Diagnosis.”Diagnostics 13 (19). https://doi.org/10.3390/DIAGNOSTICS13193135,.
D’Angelo, Gina, and Di Ran. 2025. “Tutorial on Firth’s Logistic Regression Models for Biomarkers in Preclinical Space.”Pharmaceutical Statistics 24 (1): 1–8. https://doi.org/10.1002/pst.2422.
Gosho, Masahiko, Ryota Ishii, Kengo Nagashima, Hisashi Noma, and Kazushi Maruo. 2025. “Determining the prior mean in Bayesian logistic regression with sparse data: a nonarbitrary approach.”Journal of the Royal Statistical Society. Series C: Applied Statistics 74 (1): 126–41. https://doi.org/10.1093/jrsssc/qlae048.
Klauenberg, Katy, Gerd Wübbeler, Bodo Mickan, Peter Harris, and Clemens Elster. 2015. “A tutorial on Bayesian Normal linear regression.”Metrologia 52 (6): 878–92. https://doi.org/10.1088/0026-1394/52/6/878.
Schoot, Rens van de, Sarah Depaoli, Ruth King, Bianca Kramer, Kaspar Märtens, Mahlet G. Tadesse, Marina Vannucci, et al. 2021. “Bayesian statistics and modelling.”Nature Reviews Methods Primers 1 (1): 1. https://doi.org/10.1038/s43586-020-00001-2.
Van Buuren, Stef, and Stef Van Buuren. 2012. Flexible Imputation of Missing Data. Vol. 10. CRC press Boca Raton, FL.